SI649-24 Fall -> Altair I¶

Overview¶

We're going to re-create some of the visualizations we did in Tableau but this time using Altair for the article: “The Dollar-And-Cents Case Against Hollywood’s Exclusion of Women”. We'll be teaching you different pieces of Altair over the next few weeks so we'll focus on just a few visualizations this time:

  1. Replicate 1 visualizations in the original article
  2. Implementing 2 new visualizations according to our specifications

For this lab, we have done all of the necessary data transformation for you. You do not need to modify any DataFrame. You only need to write Altair code.

Lab Instructions (read the full version on the handout of the previous lab)¶

  • Save, rename, and submit the ipynb file (use your username in the name).
  • Run every cell (do Runtime -> Restart and run all to make sure you have a clean working version), print to pdf, submit the pdf file.
  • For each visualization, we will ask you to write down a "Grammar of Graphics" plan first (basically a description of what you'll code).
  • If you end up stuck, show us your work by including links (URLs) that you have searched for. You'll get partial credit for showing your work in progress.
  • There are many bonus point opportunities in this lab.

We encourage you to go through the Altair tutorials before next week:

  • UW Course
  • Altair tutorial

Resources¶

  • Altair Documentation
  • Markdown Cheatsheet
  • Pandas DataFrame Introduction
In [48]:
# imports we will use
import altair as alt
import pandas as pd
from collections import defaultdict
alt.renderers.enable('html')#run this line if you are running jupyter notebook
Out[48]:
RendererRegistry.enable('html')
In [49]:
# load data and perform basic data processing 
# get the CSV
datasetURL="https://raw.githubusercontent.com/dallascard/SI649_public/master/altair_hw1/movies_individual_task.csv" 
movieDF=pd.read_csv(datasetURL, encoding="latin-1")

# fix the result column, rename the values, and combine "dubious" with "ok" as "Passes Bechdel Test"
movieDF['test_result'] = movieDF['clean_test'].map({
    "ok":"Passes Bechdel Test",
    "men":'Women only talk about men',
    "notalk":"Women don't talk to each other",
    "nowomen":"Fewer than two women",
    "dubious":"Passes Bechdel Test"
})

# fix the location column to combine US and Canada
locationDict = defaultdict(lambda: 'International')
locationDict["United States"]="U.S. and Canada"
locationDict["Canada"]="U.S. and Canada"
movieDF["country_binary"]=movieDF["country"].map(locationDict)

##calculate ROI (Return on Investment) both domestic (US and Canada) and international
movieDF["roi_dom"]=movieDF["domgross_2013$"]/movieDF["budget_2013$"]
movieDF["int_only_gross"]=movieDF["intgross_2013$"]-movieDF["domgross_2013$"]
movieDF["roi_int"]=movieDF["int_only_gross"]/movieDF["budget_2013$"]

# drop the columns we won't need
movieDF=movieDF.drop(columns=["Unnamed: 0","test","budget","domgross","intgross","code","period code","decade code","director","imdb"])

# Make a copy of the data frame that excludes movies from before 1990
movieDF_since_1990=movieDF[movieDF.year>1989]

#take a look at the new dataset
movieDF_since_1990.sample(3)
Out[49]:
year title clean_test binary budget_2013$ domgross_2013$ intgross_2013$ director_gender genre rating country language test_result country_binary roi_dom int_only_gross roi_int
1367 1998 Rush Hour ok PASS 50019631 201774709.0 350566156.0 male Action 7.0 United States English Passes Bechdel Test U.S. and Canada 4.033910 148791447.0 2.974661
1096 2002 Gerry nowomen FAIL 9066255 329860.0 329860.0 male Adventure 6.2 United States English Fewer than two women U.S. and Canada 0.036383 0.0 0.000000
988 2004 The Butterfly Effect men FAIL 16031507 71432302.0 118444284.0 male Sci-Fi 7.7 United States English Women only talk about men U.S. and Canada 4.455745 47011982.0 2.932474

Part 1: Recreate this visualization¶

Step 1: Write down your plan for each part of this chart:¶

For each chart, we are asking you to write a Grammar of Graphics plan for the chart. This involves writing down 1) the dataset you will use; 2) the type of mark you will use (e.g., bar, line, point, etc.), and 3) for each visual channel (e.g., position, color, etc.), the corresponding variable name (e.g., year, ROI, etc.) and data type (i.e., ordinal, nominal, or quantitative). Please use the following format:

  • Data Name: dataset
  • Mark type: mark type
  • Encoding Specification:
    • channel:variable:datatype
    • channel:variable:datatype
    • ...

Hint: you should provide encoding specifications for both x and y, using the format channel:variable:datatype For example, if we wanted to encode a nominal variable called "movietype" as the color, we would write:

  • color : movietype : nominal

*** Edit this cell to be your visualization plan (required) ***:¶

Left Chart:

  • Data Name: movieDF_since_1990
  • mark type: bar
  • Encoding Specification:
    • length:dollars earned for every dollar spent:quantitative
    • horizontal position: movie categories:nominal
    • color: Fixed blue to represent U.S. and Canada

Right Chart:

  • Data Name: movieDF_since_1990
  • mark type: bar
  • Encoding Specification:
    • length:dollars earned for every dollar spent:quantitative
    • horizontal position: movie categories:nominal
    • color: Fixed orange to represent international

Compound Method (how to join these charts together?): Use a grouped bar chart approach, where both U.S./Canada and International charts are placed side-by-side for easy comparison

Step 2: Create your chart.¶

Please use the checkpoints below to work through the problem step-by-step. You can search for the keyword "TODO" to locate cells that need your edits

Visualization 1 Checkpoints¶

checkpoint 1: create the left chart as a basic bar chart (Domestic ROI by Bechdel test category)¶

  • Specify the correct mark
  • Use the correct x and y encoding
  • Plotting the right data (hint: make sure you examine the data frame and use the correct columns)

You chart will look like:

In [50]:
left_chart = alt.Chart(movieDF_since_1990).mark_bar().encode(
    x=alt.X('median(roi_dom):Q', title='Median of roi_dom'),
    y=alt.Y('test_result:N', sort=['Fewer than two women','Passes Bechdel Test', "Women don't talk to each other", 'Women only talk about men'], title='test_result')
)

left_chart
Out[50]:

checkpoint 2: sort the categories on the y-axis¶

  • completed checkpoint1
  • applied the correct sort order to the values on the y-axis (i.e., from top to bottom, the order of the bars is "Passes Bechdel Test", "Women only talk about men", "Women don't talk to each other", "Fewer than two women")

You chart will look like:

Hint: Sort

In [51]:
left_chart = alt.Chart(movieDF_since_1990).mark_bar().encode(
    x=alt.X('median(roi_dom):Q', title='Median of roi_dom'),
    y=alt.Y('test_result:N', sort=['Passes Bechdel Test', 'Women only talk about men', "Women don't talk to each other", 'Fewer than two women'], title='test_result')
)

left_chart
Out[51]:

checkpoint 3: Add a chart title, and remove axis labels and x-axis¶

  • completed checkpoint2
  • add a chart title
  • remove the x and y-axis labels
  • remove the x-axis tick marks

You chart will look like:

Hint: Axis

In [52]:
left_chart = alt.Chart(movieDF_since_1990).mark_bar().encode(
    x=alt.X('median(roi_dom):Q', title=None, axis=None),
    y=alt.Y('test_result:N', sort=['Passes Bechdel Test', 'Women only talk about men', "Women don't talk to each other", 'Fewer than two women'], title=None)
).properties(
    title='U.S. and Canada'
)

left_chart
Out[52]:

checkpoint 4: Reshape the plot¶

  • completed checkpoint 3
  • Reshape the plot to have both width and height equal to 100

You chart will look like:

Hint: set the width and height properties of the chart

In [53]:
left_chart = alt.Chart(movieDF_since_1990).mark_bar().encode(
    x=alt.X('median(roi_dom):Q', title=None, axis=None),
    y=alt.Y('test_result:N', sort=['Passes Bechdel Test', 'Women only talk about men', "Women don't talk to each other", 'Fewer than two women'], title=None)
).properties(
    title='U.S. and Canada',
    width=100,
    height=100
)

left_chart
Out[53]:

checkpoint 5: Add a text layer with the numbers for each bar¶

  • completed checkpoint 4
  • add the numbers for each bar with correct formatting (two decimal places)

You chart will look like:

Hint 1: In Altair you can overlay two charts on top of each other using the "+" notation (e.g., chart1 + chart 2)

Hint 2: You can create a text layer that inherits everything from the base layer by using base.mark_text().encode(text="..."), where base is the name of the base bar chart, and the "..." is the data to show as text

Hint 3: Use the "dx" property of mark_text() to nudge the text left or right (see https://altair-viz.github.io/gallery/bar_chart_with_labels.html)

In [54]:
base = alt.Chart(movieDF_since_1990).mark_bar().encode(
    x=alt.X('median(roi_dom):Q', title=None, axis=None),
    y=alt.Y('test_result:N', sort=['Passes Bechdel Test', 'Women only talk about men', "Women don't talk to each other", 'Fewer than two women'], title=None)
).properties(
    title='U.S. and Canada',
    width=100,
    height=100
)

text = base.mark_text(
    align='left',
    baseline='middle',
    dx=3  
).encode(
    text=alt.Text('median(roi_dom):Q', format='.2f')
)

left_chart = base + text

left_chart
Out[54]:

checkpoint 6: remove the x-axis line and chart box, and increase the padding between bars¶

  • completed checkpoint 5
  • remove the x-axis line
  • remove the box around the figure
  • increase the padding between bars

You chart will look like:

Hint 1: You can make use of configure_axis() and configure_view() of the overall view (base and text layer combined) Hint 2: There are multiple ways to increase the spacing between bars

In [55]:
left_chart = (base + text).configure_axis(
    grid=False
).configure_view(
    strokeOpacity=0
).configure_bar(
    size=20  
)

left_chart
Out[55]:

checkpoint 7: create the right chart using International ROI with the same stylings as the left chart¶

  • completed checkpoint 6
  • create the right chart with the same stylings as the left chart
    • correct data
    • correct mark
    • correct encoding
    • apply correct sort order
    • no x-and y-axis labels
    • no x-axis
    • no box on chart
    • text lables with proper formatting and alignment
    • include the title for International ROI

You chart will look like:

In [56]:
right_chart_base = alt.Chart(movieDF_since_1990).mark_bar().encode(
    x=alt.X('median(roi_int):Q', title=None, axis=None),
    y=alt.Y('test_result:N', sort=['Passes Bechdel Test', 'Women only talk about men', "Women don't talk to each other", 'Fewer than two women'], title=None)
).properties(
    title='International',
    width=100,
    height=100
)

right_chart_text = right_chart_base.mark_text(
    align='left',
    baseline='middle',
    dx=3  
).encode(
    text=alt.Text('median(roi_int):Q', format='.2f')
)

right_chart = (right_chart_base + right_chart_text).configure_axis(
    grid=False
).configure_view(
    strokeOpacity=0
).configure_bar(
    size=20  
)

right_chart
Out[56]:

checkpoint 8: remove y-axis labels and change color¶

  • completed checkpoint 7
  • remove y-axis labels
  • set the bar color

You chart will look like:

In [57]:
right_chart = (right_chart_base + right_chart_text).configure_axis(
    grid=False,
    labels=False 
).configure_view(
    strokeOpacity=0
).configure_bar(
    size=20  
).configure_mark(
    color='orange'
)

right_chart
Out[57]:

checkpoint 9: combine the two charts together¶

  • display both completed charts side by side

You chart will look like:

Hint: You will need to move your view and axis configurations to the overall combined chart!

In [58]:
# Base chart for U.S. and Canada
base = alt.Chart(movieDF_since_1990).mark_bar(color='steelblue').encode(
    x=alt.X('median(roi_dom):Q', title=None, axis=None),
    y=alt.Y('test_result:N', sort=[
        'Passes Bechdel Test', 
        'Women only talk about men', 
        "Women don't talk to each other", 
        'Fewer than two women'
    ], title=None)
).properties(
    title='U.S. and Canada',
    width=100,
    height=100
)

# Add text to U.S. and Canada chart
text = base.mark_text(
    align='left',
    baseline='middle',
    dx=3  
).encode(
    text=alt.Text('median(roi_dom):Q', format='.2f')
)

left_chart = base + text

# Base chart for International
right_chart_base = alt.Chart(movieDF_since_1990).mark_bar(color='orange').encode(
    x=alt.X('median(roi_int):Q', title=None, axis=None),
    y=alt.Y('test_result:N', sort=[
        'Passes Bechdel Test', 
        'Women only talk about men', 
        "Women don't talk to each other", 
        'Fewer than two women'
    ], axis = None, title=None)
).properties(
    title='International',
    width=100,
    height=100
)

# Add text to International chart
right_chart_text = right_chart_base.mark_text(
    align='left',
    baseline='middle',
    dx=3  
).encode(
    text=alt.Text('median(roi_int):Q', format='.2f')
)

right_chart = right_chart_base + right_chart_text

# Concatenate the two charts
combined_chart = alt.hconcat(
    left_chart, 
    right_chart
).configure_axis(
    grid=False,
    domain=False  # This removes the axis lines
).configure_view(
    strokeOpacity=0
).configure_bar(
    size=20
)

# Display the combined chart
combined_chart
Out[58]:

BONUS: add an overall title and add dollar symbols to text marks¶

  • complete checkpoint 9
  • add an overall title
  • add dollar signs to text labels

You chart will look like:

In [59]:
# Base chart for U.S. and Canada
base = alt.Chart(movieDF_since_1990).mark_bar(color='steelblue').encode(
    x=alt.X('median(roi_dom):Q', title=None, axis=None),
    y=alt.Y('test_result:N', sort=[
        'Passes Bechdel Test', 
        'Women only talk about men', 
        "Women don't talk to each other", 
        'Fewer than two women'
    ], title=None)
).properties(
    title='U.S. and Canada',
    width=100,
    height=100
)

# Add text to U.S. and Canada chart
text = base.mark_text(
    align='left',
    baseline='middle',
    dx=3  
).encode(
    text=alt.Text('median(roi_dom):Q', format='$,.2f')
)

left_chart = base + text

# Base chart for International
right_chart_base = alt.Chart(movieDF_since_1990).mark_bar(color='orange').encode(
    x=alt.X('median(roi_int):Q', title=None, axis=None),
    y=alt.Y('test_result:N', sort=[
        'Passes Bechdel Test', 
        'Women only talk about men', 
        "Women don't talk to each other", 
        'Fewer than two women'
    ], axis=None, title=None)
).properties(
    title='International',
    width=100,
    height=100
)

# Add text to International chart
right_chart_text = right_chart_base.mark_text(
    align='left',
    baseline='middle',
    dx=3  
).encode(
    text=alt.Text('median(roi_int):Q', format='$,.2f')
)

right_chart = right_chart_base + right_chart_text

# Concatenate the two charts
combined_chart = alt.hconcat(
    left_chart, 
    right_chart
).configure_axis(
    grid=False,
    domain=False  # This removes the axis lines
).configure_view(
    strokeOpacity=0
).configure_bar(
    size=20
).properties(
    title='Dollars Earned for Each Dollar Spent (2013 Dollars)'
)

# Display the combined chart
combined_chart
Out[59]:

Visualization 2: Replicate this visualization¶

*** Step 1: Write down your plan for the visualization (required) ***¶

  • Data Name: movieDF
  • mark type: line chart
  • Encoding Specification (1st chart):
    • x:year:ordinal
    • y:mean budget:quantitative
  • Encoding Specification (2nd chart):
    • x:year:ordinal
    • y:median budget:quantitative
  • Encoding Specification (3rd chart):
    • x:year:ordinal
    • y:max budget:quantitative

Compound Method (how to join these charts together?): Use a vertical facet layout to stack the charts on top of each other. Each line chart (mean, median, and max) represents a separate aspect of movie budgets over time, but they share the same x-axis (year) for easier comparison across the charts.

Step 2: Create your chart.¶

Please use the checkpoints below to work through the problem step-by-step. You can search for the keyword "TODO" to locate cells that need your edits

Visualization 2 Checkpoints¶

checkpoint 1: line chart for average, median, and max of budget¶

You will get full points if you

  • Specify the correct mark
  • Use the correct x and y encoding
  • Plotting the right data
  • Produce 3 line charts concatenated vertically

You chart will look like:

In [60]:
# Line chart for average budget
average_budget = alt.Chart(movieDF).mark_line().encode(
    x='year:O',
    y='mean(budget_2013$):Q'
).properties(
    width=500,
    height=200
)

# Line chart for median budget
median_budget = alt.Chart(movieDF).mark_line().encode(
    x='year:O',
    y='median(budget_2013$):Q'
).properties(
    width=500,
    height=200
)

# Line chart for max budget
max_budget = alt.Chart(movieDF).mark_line().encode(
    x='year:O',
    y='max(budget_2013$):Q'
).properties(
    width=500,
    height=200
)

# Concatenate the charts vertically
budget_charts = alt.vconcat(
    average_budget,
    median_budget,
    max_budget
)

budget_charts
Out[60]:

checkpoint 2: adjust width, height and color¶

Each chart should be 500x100, plotted with different colors

  • Complete checkpoint 1
  • Adjust chart width and height
  • Plot charts with different colors

You chart will look like:

In [61]:
# Line chart for average budget
average_budget = alt.Chart(movieDF).mark_line().encode(
    x='year:O',
    y='mean(budget_2013$):Q'
).properties(
    width=500,
    height=100
)

# Line chart for median budget
median_budget = alt.Chart(movieDF).mark_line(color='grey').encode(
    x='year:O',
    y='median(budget_2013$):Q'
).properties(
    width=500,
    height=100
)

# Line chart for max budget
max_budget = alt.Chart(movieDF).mark_line(color='pink').encode(
    x='year:O',
    y='max(budget_2013$):Q'
).properties(
    width=500,
    height=100
)

# Concatenate the charts vertically
budget_charts = alt.vconcat(
    average_budget,
    median_budget,
    max_budget
)

budget_charts
Out[61]:

checkpoint 3: remove duplicated x-axis and adjust tick spacing¶

You will get full points if you

  • Complete checkpoint 2
  • Remove duplicate x-axes from top and middle figures
  • Set the bottom x-axes to have ticks every 5 years

You chart will look like:

In [62]:
# Line chart for average budget
average_budget = alt.Chart(movieDF).mark_line().encode(
    x=alt.X('year:Q', axis=alt.Axis(labels=False, ticks=False)), 
    y='mean(budget_2013$):Q'
).properties(
    width=500,
    height=100
)

# Line chart for median budget
median_budget = alt.Chart(movieDF).mark_line(color='grey').encode(
    x=alt.X('year:Q', axis=alt.Axis(labels=False, ticks=False)), 
    y='median(budget_2013$):Q'
).properties(
    width=500,
    height=100
)

# Line chart for max budget
max_budget = alt.Chart(movieDF).mark_line(color='pink').encode(
    x=alt.X('year:Q', axis=alt.Axis(tickCount=9)),  
    y='max(budget_2013$):Q'
).properties(
    width=500,
    height=100
)

# Concatenate the charts vertically
budget_charts = alt.vconcat(
    average_budget,
    median_budget,
    max_budget
)

budget_charts
Out[62]:

Visualization 3: Replicate this visualization¶

*** Step 1: Write down your plan for the visualization, for all four channels (required) ***¶

  • Data Name: movieDF
  • mark type: scatter plot
  • Encoding Specification:
    • x:rating:quantitative
    • y:budget_2013$:quantitative
    • color:country_binary:nominal
    • size:roi_int:quantitative

Step 2: Create your chart.¶

Please use the checkpoints below to work through the problem step-by-step. You can search for the keyword "TODO" to locate cells that need your edits

checkpoint 1: scatter plot of IMDB rating vs budget (in 2013 dollars)¶

You will get full points if you

  • Specify the correct mark
  • Plotting the right data
  • Use the correct x and y encoding

You chart will look like:

In [63]:
scatter_plot = alt.Chart(movieDF).mark_circle().encode(
    x=alt.X('rating:Q', title='IMDB Rating'),
    y=alt.Y('budget_2013$:Q', title='Budget (2013 Dollars)')
)

scatter_plot
Out[63]:

checkpoint 2: add the color and size channels¶

  • Complete checkpoint 1
  • Add the color channel
  • Add the size channel
  • Plot the right data

You chart will look like:

In [64]:
scatter_plot = alt.Chart(movieDF).mark_circle().encode(
    x=alt.X('rating:Q'),
    y=alt.Y('budget_2013$:Q'),
    color=alt.Color('country_binary:N'),
    size=alt.Size('roi_int:Q')
)

scatter_plot
Out[64]:

checkpoint 3: adjust the Legend for the size channel¶

  • Complete checkpoint 2
  • set the legend for the Size channel to explicitly include 0

You chart will look like:

Hint: You will need to adjust the legend using alt.Legend() within alt.Size()

In [65]:
scatter_plot = alt.Chart(movieDF).mark_circle().encode(
    x=alt.X('rating:Q'),
    y=alt.Y('budget_2013$:Q'),
    color=alt.Color('country_binary:N'),
    size=alt.Size('roi_int:Q', 
                  legend=alt.Legend(title="ROI", values=[0, 50, 100, 200]))
)

scatter_plot
Out[65]:

checkpoint 4: adjust the Scale for the Size channel¶

  • Complete checkpoint 3
  • set the scale of the size channel to map the data onto the range 10 to 300 (this maps 0 in the data to a size of 10)

You chart will look like:

Hint: You will need to adjust the scale of the size channel using alt.Scale()

In [66]:
scatter_plot = alt.Chart(movieDF).mark_circle().encode(
    x=alt.X('rating:Q'),
    y=alt.Y('budget_2013$:Q'),
    color=alt.Color('country_binary:N'),
    size=alt.Size('roi_int:Q', 
                  legend=alt.Legend(title="ROI", values=[0, 50, 100, 200]),  
                  scale=alt.Scale(range=[10, 300]))  
)

scatter_plot
Out[66]:

End of Lab

To submit your assignment:

  1. Please run all cells (Runtime > Run all), and make sure all the cells ran properly!!
  2. Make sure you have named your .ipynb file with your uniqname: i.e., uniqname.ipynb
  3. Upload your .ipynb file to Canvas.
In [ ]: